{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9b399a34",
   "metadata": {},
   "source": [
    "# 在SecretFlow中加载Numpy数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11b8eb49",
   "metadata": {},
   "source": [
    "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bbd8384",
   "metadata": {},
   "source": [
    "本教程将展示下，怎样在SecretFlow的多方安全环境中加载numpy数据。  \n",
    "SecretFlow支持`npy`，`npz`多种格式，接口封装和`numpy`保持一致"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8383ac24",
   "metadata": {},
   "source": [
    "## 环境设置"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8ee31441",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bb86c839",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-17 16:47:03,538\tINFO worker.py:1538 -- Started a local Ray instance.\n"
     ]
    }
   ],
   "source": [
    "import secretflow as sf\n",
    "\n",
    "# Check the version of your SecretFlow\n",
    "print('The version of SecretFlow: {}'.format(sf.__version__))\n",
    "\n",
    "# In case you have a running secretflow runtime already.\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=True)\n",
    "alice, bob ,charlie = sf.PYU('alice'), sf.PYU('bob') , sf.PYU('charlie')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e4f9827",
   "metadata": {},
   "source": [
    "## 接口介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf743273",
   "metadata": {},
   "source": [
    "我们在SecretFlow中提供了类似于`numpy.load`的接口`secretflow.data.ndarray.load`来将各方数据的`ndarray`读取成为一个联邦概念的数据。  \n",
    "\n",
    "通过secretflow.data.load可以读取多方的numpy文件，构成一个`FedNdarray`\n",
    "\n",
    "接口介绍：[secretflow.data.load](https://www.secretflow.org.cn/docs/secretflow/en/source/secretflow.data.html#secretflow.data.ndarray.load)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "947c30c9",
   "metadata": {},
   "source": [
    "## 数据下载与分割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8b53809f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E0417 16:47:08.224327619   13264 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\n"
     ]
    }
   ],
   "source": [
    "%%capture\n",
    "%%!\n",
    "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz\n",
    "pip install opencv-python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fb1ca464",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "all_data = np.load(\"./mnist.npz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20ccb301",
   "metadata": {},
   "source": [
    "对数据进行拆分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0ff318ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_train_x = all_data[\"x_train\"][:30000]\n",
    "alice_test_x = all_data[\"x_test\"][:30000]\n",
    "alice_train_y = all_data[\"y_train\"][:30000]\n",
    "alice_test_y = all_data[\"y_test\"][:30000]\n",
    "\n",
    "bob_train_x = all_data[\"x_train\"][30000:]\n",
    "bob_test_x = all_data[\"x_test\"][30000:]\n",
    "bob_train_y = all_data[\"y_train\"][30000:]\n",
    "bob_test_y = all_data[\"y_test\"][30000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41652edb",
   "metadata": {},
   "source": [
    "分别保存成npz格式文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a8ece9aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.savez(\"./alice_mnist.npz\",train_x = alice_train_x, test_x = alice_test_x, train_y=alice_train_y, test_y = alice_test_y)\n",
    "np.savez(\"./bob_mnist.npz\",train_x = bob_train_x, test_x = bob_test_x, train_y=bob_train_y, test_y = bob_test_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a4ab72d",
   "metadata": {},
   "source": [
    "将alice和bob的train_x保存成npy格式。方便后文npy格式读取使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0d29231e",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.save(\"./alice_mnist_train_x.npy\",alice_train_x)\n",
    "np.save(\"./bob_mnist_train_x.npy\",bob_train_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd135e73",
   "metadata": {},
   "source": [
    "## 加载npz文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1e8e7578",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_path = \"./alice_mnist.npz\"\n",
    "bob_path = \"./bob_mnist.npz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "67fc91ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "from secretflow.data.ndarray import load\n",
    "from secretflow.data.split import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "bcca6eff",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_npz = load({alice:alice_path, bob:bob_path},allow_pickle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a66aa798",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_npz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cde8331",
   "metadata": {},
   "source": [
    "FedNpz的每一个value是是FedNdarray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f9e378ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "secretflow.data.ndarray.FedNdarray"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(fed_npz[\"train_x\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e044e91",
   "metadata": {},
   "source": [
    "## 加载npy文件"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "166a2c68",
   "metadata": {},
   "source": [
    "加载npy就很简单了，直接调用load接口读取出来就是一个标准的FedNdarray对象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c718b1c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_path = \"./alice_mnist_train_x.npy\"\n",
    "bob_path = \"./bob_mnist_train_x.npy\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1fb4fe11",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_ndarray = load({alice:alice_path, bob:bob_path},allow_pickle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0f21b86c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "secretflow.data.ndarray.FedNdarray"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(fed_ndarray)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "936ba7ff",
   "metadata": {},
   "source": [
    "## 应该怎样将我已有的数据转成FedNdarray再进行读取？"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "177fd75c",
   "metadata": {},
   "source": [
    "那我们怎么样将其他类型的数据转成FedNdarray 数据呢？  \n",
    "比如我有一个图像数据集，语音数据集，我该怎么样通过FedNdarray将数据传入联邦模型  \n",
    "我们这里以花卉分类数据集Flower来举个例子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e4158ea5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:32.352101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:32.352111: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz\n",
      "228813984/228813984 [==============================] - 2s 0us/step\n"
     ]
    }
   ],
   "source": [
    "import tempfile\n",
    "import tensorflow as tf\n",
    "\n",
    "_temp_dir = tempfile.mkdtemp()\n",
    "path_to_flower_dataset = tf.keras.utils.get_file(\n",
    "    \"flower_photos\",\n",
    "    \"https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz\",\n",
    "    untar=True,\n",
    "    cache_dir=_temp_dir,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2e9c468",
   "metadata": {},
   "source": [
    "下载解压后根目录存在 Root = \"flower_photos\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "e11441a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3670, 240, 240, 3)\n",
      "(3670,)\n",
      "alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)\n",
      "bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)\n"
     ]
    }
   ],
   "source": [
    "import os, glob\n",
    "import numpy as np\n",
    "import cv2 #依赖需要自行安装 pip install opencv-python\n",
    "\n",
    "root = path_to_flower_dataset\n",
    "classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']\n",
    "img_paths = []  # 保存所有图片路径\n",
    "labels = []  # 保存图片的类别标签(0,1,2,3,4)\n",
    "for i, label in enumerate(classes):\n",
    "    cls_img_paths = glob.glob(os.path.join(root, label, \"*.jpg\"))\n",
    "    img_paths.extend(cls_img_paths)\n",
    "    labels.extend([i] * len(cls_img_paths))\n",
    "\n",
    "# 图片->numpy\n",
    "img_numpys = []\n",
    "labels = np.array(labels)\n",
    "for img_path in img_paths:\n",
    "    img_numpy = cv2.imread(img_path)\n",
    "    img_numpy = cv2.resize(img_numpy, (240, 240))\n",
    "    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))\n",
    "    # If use Pytorch backend dimension should be exchanged\n",
    "    # img_numpy = np.transpose(img_numpy, (0,3,1,2))\n",
    "    img_numpys.append(img_numpy)\n",
    "\n",
    "images = np.concatenate(img_numpys, axis=0)\n",
    "print(images.shape)  \n",
    "print(labels.shape)  \n",
    "\n",
    "# 给两个节点分配images和labels，各分配50%的数据\n",
    "per=0.5\n",
    "alice_images=images[:int(per*images.shape[0]),:,:,:]\n",
    "alice_label=labels[:int(per*images.shape[0])]\n",
    "bob_images=images[int(per*images.shape[0]):,:,:,:]\n",
    "bob_label=labels[int(per*images.shape[0]):]\n",
    "print(f\"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}\")  \n",
    "print(f\"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}\")  \n",
    "\n",
    "# 分别保存为npz，然后发送给两台机器\n",
    "np.savez(\"flower_alice.npz\",image=alice_images,label=alice_label)\n",
    "np.savez(\"flower_bob.npz\",image=bob_images,label=bob_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "942c44d6",
   "metadata": {},
   "source": [
    "得到需要的NPZ之后，使用上面介绍过的Load函数读取成FedNdarray，输入到模型中即可开始训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "fb97921c",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_flower_npz = load({alice:\"./flower_alice.npz\",bob:\"./flower_bob.npz\"},allow_pickle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "b851ee52",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_flower_npz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "0f76e617",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_image = fed_flower_npz[\"image\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "368fa687",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_image.partition_shape()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d17ec85c",
   "metadata": {},
   "source": [
    "## 小建议"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c2dd4a8",
   "metadata": {},
   "source": [
    "建议将数据转为ndarray类型之后，使用单机版训练引擎进行测试，检查数据格式是否正确匹配模型。然后再使用隐语的联邦框架进行测试，可以提高排查效率.  \n",
    "*注意：在使用图像数据集是要注意维度排列*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
